Learning Goals

Lab Description

We will work with two Starbucks datasets, one on the store locations (global) and one for the nutritional data for their food and drink items. We will do some text analysis of the menu items.

Deliverables

Upload an html file to Quercus and make sure the figures remain interactive.

Steps

0. Install and load libraries

1. Read in the data

  • There are 4 datasets to read in, Starbucks locations, Starbucks nutrition, US population by state, and US state abbreviations. All of them are on the course GitHub.
sb_locs <- read_csv("https://raw.githubusercontent.com/JSC370/JSC370-2025/refs/heads/main/data/starbucks/starbucks-locations.csv")
## Rows: 25600 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Store Number, Store Name, Ownership Type, Street Address, City, St...
## dbl  (2): Longitude, Latitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sb_nutr <- read_csv("https://raw.githubusercontent.com/JSC370/JSC370-2025/refs/heads/main/data/starbucks/starbucks-menu-nutrition.csv")
## Rows: 205 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Item, Category
## dbl (5): Calories, Fat (g), Carb. (g), Fiber (g), Protein (g)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
usa_pop <- read_csv("https://raw.githubusercontent.com/JSC370/JSC370-2025/refs/heads/main/data/starbucks/us_state_pop.csv")
## Rows: 55 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): state
## dbl (1): population
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
usa_states<-read_csv("https://raw.githubusercontent.com/JSC370/JSC370-2025/refs/heads/main/data/starbucks/states.csv")
## Rows: 51 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): State, Abbreviation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Look at the data

  • Inspect each dataset to look at variable names and ensure it was imported correctly.
head(sb_locs)
## # A tibble: 6 × 12
##   `Store Number` `Store Name`            `Ownership Type` `Street Address` City 
##   <chr>          <chr>                   <chr>            <chr>            <chr>
## 1 47370-257954   Meritxell, 96           Licensed         Av. Meritxell, … Ando…
## 2 22331-212325   Ajman Drive Thru        Licensed         1 Street 69, Al… Ajman
## 3 47089-256771   Dana Mall               Licensed         Sheikh Khalifa … Ajman
## 4 22126-218024   Twofour 54              Licensed         Al Salam Street  Abu …
## 5 17127-178586   Al Ain Tower            Licensed         Khaldiya Area, … Abu …
## 6 17688-182164   Dalma Mall, Ground Flo… Licensed         Dalma Mall, Mus… Abu …
## # ℹ 7 more variables: `State/Province` <chr>, Country <chr>, Postcode <chr>,
## #   `Phone Number` <chr>, Timezone <chr>, Longitude <dbl>, Latitude <dbl>
head(sb_nutr)
## # A tibble: 6 × 7
##   Item         Category Calories `Fat (g)` `Carb. (g)` `Fiber (g)` `Protein (g)`
##   <chr>        <chr>       <dbl>     <dbl>       <dbl>       <dbl>         <dbl>
## 1 Chonga Bagel Food          300         5          50           3            12
## 2 8-Grain Roll Food          380         6          70           7            10
## 3 Almond Croi… Food          410        22          45           3            10
## 4 Apple Fritt… Food          460        23          56           2             7
## 5 Banana Nut … Food          420        22          52           2             6
## 6 Blueberry M… Food          380        16          53           1             6
head(usa_pop)
## # A tibble: 6 × 2
##   state      population
##   <chr>           <dbl>
## 1 Alabama       4779736
## 2 Alaska         710231
## 3 Arizona       6392017
## 4 Arkansas      2915918
## 5 California   37253956
## 6 Colorado      5029196
head(usa_states)
## # A tibble: 6 × 2
##   State      Abbreviation
##   <chr>      <chr>       
## 1 Alabama    AL          
## 2 Alaska     AK          
## 3 Arizona    AZ          
## 4 Arkansas   AR          
## 5 California CA          
## 6 Colorado   CO

3. Format and merge the data

  • Subset Starbucks data to the US.
  • Create counts of Starbucks stores by state.
  • Merge population in with the store count by state.
  • Inspect the range values for each variable.
sb_usa <- sb_locs |> filter(Country == 'US')

sb_locs_state <- sb_usa |>
  rename(state = 'State/Province') |>
  group_by(state) |>
  summarize(n_stores = n())

# need state abbreviations
usa_pop_abbr <- 
  full_join(
    usa_pop, usa_states, by=join_by(state == State)
  ) 
  
sb_locs_state <- full_join(
  sb_locs_state, usa_pop_abbr, by=join_by(state == Abbreviation)
)

summary(sb_locs_state)
##     state              n_stores        state.y            population      
##  Length:55          Min.   :   8.0   Length:55          Min.   :   56882  
##  Class :character   1st Qu.:  56.5   Class :character   1st Qu.: 1344331  
##  Mode  :character   Median : 123.0   Mode  :character   Median : 3751351  
##                     Mean   : 266.8                      Mean   : 5677621  
##                     3rd Qu.: 332.0                      3rd Qu.: 6515716  
##                     Max.   :2821.0                      Max.   :37253956  
##                     NA's   :4

4. Use ggplotly for EDA

Answer the following questions:

  • Are the number of Starbucks proportional to the population of a state? (scatterplot)

  • Is the caloric distribution of Starbucks menu items different for drinks and food? (histogram)

  • What are the top 20 words in Starbucks menu items? (bar plot)

p1 <- sb_locs_state |>
  ggplot(aes(x=population, y=n_stores, color=state)) +
  geom_point(alpha=0.8) +
  theme_bw()

ggplotly(p1)
  • Are the number of Starbucks proportional to the population of a state? (scatterplot)
  • 4a) Answer: Yes, from the scatterplot, we see a correlation between number of starbuck stores and the population.
p2 <- sb_nutr |>
  ggplot(aes(x=Calories, fill=Category)) +
  geom_histogram(alpha=0.8) +
  theme_bw()

ggplotly(p2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
  • Is the caloric distribution of Starbucks menu items different for drinks and food? (histogram)
  • 4b) Answer: Yes, we see difference in distribution between food and drinks. Most drinks have less calories than food (the average calories for food is > the mean for drinks).
top20 <- sb_nutr |>
  unnest_tokens(word, Item, token='words') |>
  count(word, sort=T) |>
  head(20)

p3 <- top20 |>
  ggplot(aes(fct_reorder(word, n), n)) +
  xlab('Top 20 Words') +
  ylab('Count') +
  geom_col() + 
  coord_flip() +
  theme_bw()

ggplotly(p3)
  • What are the top 20 words in Starbucks menu items? (bar plot)
  • 4c) Answer: The top words are shown above.

5. Scatterplots using plot_ly()

  • Create a scatterplot using plot_ly() representing the relationship between calories and carbs. Color the points by category (food or beverage). Is there a relationship, and do food or beverages tend to have more calories?
sb_nutr |>
  plot_ly(x=~Calories, y=~`Carb. (g)`,
          type='scatter', mode='markers', color=~Category)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
  • 5a) Answer: Food tends to have more calories than drinks despite having the same amount of carbs. In both categories, we see a correlation between the amount of carbs and calories.

  • Repeat this scatterplot but for the items that include the top 10 words. Color again by category, and add hoverinfo specifying the word in the item name. Add layout information to title the chart and the axes, and enable hovermode = "compare".

  • What are the top 10 words and is the plot much different than above?

top10 <- sb_nutr |>
  unnest_tokens(word, Item, token='words') |>
  group_by(word) |>
  summarise(word_freq = n()) |>
  arrange(across(word_freq, desc)) |>
  head(10)

top10
## # A tibble: 10 × 2
##    word      word_freq
##    <chr>         <int>
##  1 iced             32
##  2 bottled          26
##  3 tazo             26
##  4 sandwich         19
##  5 chocolate        18
##  6 coffee           13
##  7 egg              12
##  8 starbucks        12
##  9 tea              12
## 10 black            11
sb_nutr |>
  unnest_tokens(word, Item, token='words') |>
  filter(word %in% top10$word) |>
  plot_ly(x=~Calories, y=~`Carb. (g)`, color=~Category,
          type='scatter', mode='markers',
          hoverinfo = 'text',
          text = ~paste0('Item: ', word)
        ) |>
  layout(
    title='Cal vs. Carbs',
    xaxis= list(title='Calories'),
    yaxis= list(title='Carbs'),
    hovermode='compare'
  )
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
  • 5b) Answer: The top 10 words are (in descending order):
  1. iced
  2. bottled
  3. tazo
  4. sandwich
  5. chocolate
  6. coffee
  7. egg
  8. starbucks
  9. tea
  10. black

The plot looks identical to the one above.

6. plot_ly Boxplots

  • Create a boxplot of all of the nutritional variables in groups by the 10 item words.
  • Which top word has the most calories? Which top word has the most protein?
sb_nutr_long <- sb_nutr |>
  unnest_tokens(word, Item, token='words') |>
  filter(word %in% top10$word) |>
  pivot_longer(
    cols = c(Calories, `Carb. (g)`, `Protein (g)`, `Fiber (g)`, `Fat (g)`),
    names_to='Nutrient',
    values_to='Value'
  )
  
sb_nutr_long |>
  plot_ly(
    x = ~word,
    y= ~Value,
    color=~Nutrient,
    type='box'
  ) |>
  layout(
    title='Nutrition values for the top 10 word items',
    xaxis=list(title='Item word'),
    yaxis=list(title='Nutrition value'),
    boxmode='group'
  )
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
    1. Answer: The word sandwich has the points with most calories and grams of protein, with values 600 and 32 respectively.

7. 3D Scatterplot

  • Create a 3D scatterplot between Calories, Carbs, and Protein for the items containing the top 10 words
  • Do you see any patterns (clusters or trends)?
sb_nutr |>
  unnest_tokens(word, Item, token='words') |>
  filter(word %in% top10$word) |>
  plot_ly(
    x = ~Calories,
    y= ~`Carb. (g)`,
    z= ~`Protein (g)`,
    color=~word,
    type='scatter3d',
    mode='markers',
    marker=list(size=5)
  ) |>
  layout(
    title='3D scatterplot of Calories, Carbs. and Protein',
    scene = list (
      xaxis=list(title='Calories'),
      yaxis=list(title='Carbs (g)'),
      zaxis=list(title='Protein (g)')
    )
  )
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
    1. Answer: There seems to be positive correlation between calories and carbs and calories and proteins. For carb vs. protein, there is a cluster of high protein, high carb item for the word sandwich.

8. plot_ly Map

  • Create a map to visualize the number of stores per state, and another for the population by state. Add custom hover text. Use subplot to put the maps side by side.
  • Describe the differences if any.
# Set up mapping details
set_map_details <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('steelblue')
)

# Make sure both maps are on the same color scale
shadeLimit <- 125

# Create hover text
sb_locs_state$hover <- with(sb_locs_state, paste("Number of Starbucks: ", n_stores, '<br>', "State: ", state.y, '<br>', "Population: ", population))

# Create the map
map1 <- plot_geo(sb_locs_state, locationmode='USA-states') |>
  add_trace(z = ~n_stores, text= ~hover, locations= ~state, color= ~n_stores, colors='Purples') |>
  layout(title='Starbucks store by state in the US',
         geo=set_map_details)
map1
## Warning: Ignoring 4 observations
map2 <- plot_geo(sb_locs_state, locationmode='USA-states') |>
  add_trace(z = ~population, text= ~hover, locations= ~state, color= ~population, colors='Purples') |>
  layout(title='US Population by state',
         geo=set_map_details)
map2
subplot(map1, map2)
## Warning: Ignoring 4 observations
    1. Answer: There seem to be positive correlation between population and the number of Starbucks store, with the highest number of Starbucks store in California, which is also the state with the highest population count. We also see this trend in Texas (TX), Florida (FL), New York (NY), etc.